Efficient approximate planning in continuous space Markovian Decision Problems

نویسنده

  • Csaba Szepesvári
چکیده

MDPs provide a clean and simple, yet fairly rich framework for studying various aspects of intelligence, such as planning. A well-known practical limitation of planning in MDPs is called the curse of dimensionality [1], referring to the exponential rise in the resources required to compute (even approximate) solutions to an MDP as the size of the MDP (the number of state variables) increases. For example, conventional dynamic programming (DP) algorithms, such as valueor policy-iteration scale exponentially with the size, even if they are combined with sophisticated multigrid algorithms [4]. Moreover, the curse of dimensionality is not specific to any algorithm, as shown by a result of Chow and Tsitsiklis [3]. Recently, Kearns et al. have shown that a certain on-line, tree building algorithm avoids the curse of dimensionality in discounted MDPs [9]. Recently, this result has been extended to partially observable MDPs (POMDPs) by the same authors [8]. The bounds in these two papers are independent of the size of the state space, but scale exponentially with 1 1−γ , the effective horizon-time, where γ is the discount factor of the MDP. In this paper we consider another on-line planning algorithm that will be shown to scale polynomially with the horizon-time, as well. The price of this is that we have to assume more regularity on the MDPs we consider. In particular, we will restrict ourselves to stochastic MDPs with finite action spaces and state space X = [0, 1], and, more importantly, assume that the transition probability kernel of the MDPs are subject to the Lipschitzcondition |p(x|x1, a)− p(x|x2, a)| ≤ Lp ‖x1 − x2‖1 for any states x1, x2, x′ ∈ [0, 1] and action a ∈ A. Here Lp > 0 is a given fixed number and ‖·‖1 denotes the `1 norm of vectors. Another restriction (quite common in the literature) that we will assume is the uniform boundedness of the transition probabilities (the bound shall be denoted by Kp) and of the immediate rewards (bound denoted by Kr). Further, our bounds will depend on the dimension of the state space, d.1 The idea of the considered algorithms originates in the algorithm considered by Rust [13].2 Rust studied a more restricted class of problems than considered in this paper and proved the following result. First, let us define the concept of ε-optimality in the mean. Fix an MDP with state space X . A random, real-valued function V̂ with domain X is called ε-optimal in the mean if E [∥∥∥V̂ − V ∗∥∥∥ ∞ ] ≤ ε, where V ∗ is the optimal value function underlying the selected MDP and ‖·‖∞ is the maximum-norm and the expectation is

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Computing Near Optimal Strategies for Stochastic Investment Planning Problems

We present efficient techniques for computing near optimal strategies for a class of stochastic commodity trading problems modeled as Markov decision processes (MDPs). The process has a continuous state space and a large action space and cannot be solved efficiently by standard dynamic programming methods. We exploit structural properties of the process, and combine it with MonteCarlo estimatio...

متن کامل

Bandit Based Monte-Carlo Planning

For large state-space Markovian Decision Problems MonteCarlo planning is one of the few viable approaches to find near-optimal solutions. In this paper we introduce a new algorithm, UCT, that applies bandit ideas to guide Monte-Carlo planning. In finite-horizon or discounted MDPs the algorithm is shown to be consistent and finite sample bounds are derived on the estimation error due to sampling...

متن کامل

Efficient Approximate Value Iteration for Continuous Gaussian POMDPs

We introduce a highly efficient method for solving continuous partially-observable Markov decision processes (POMDPs) in which beliefs can be modeled using Gaussian distributions over the state space. Our method enables fast solutions to sequential decision making under uncertainty for a variety of problems involving noisy or incomplete observations and stochastic actions. We present an efficie...

متن کامل

ReTrASE: Integrating Paradigms for Approximate Probabilistic Planning

Past approaches for solving MDPs have several weaknesses: 1) Decision-theoretic computation over the state space can yield optimal results but scales poorly. 2) Value-function approximation typically requires human-specified basis functions and has not been shown successful on nominal (“discrete”) domains such as those in the ICAPS planning competitions. 3) Replanning by applying a classical pl...

متن کامل

Towards domain-independent, task-oriented, conversational adequacy

Past approaches for solving MDPs have several weaknesses: 1) Decision-theoretic computation over the state space can yield optimal results but scales poorly. 2) Value-function approximation typically requires human-specified basis functions and has not been shown successful on nominal (“discrete”) domains such as those in the ICAPS planning competitions. 3) Replanning by applying a classical pl...

متن کامل

Dynamic Non-uniform Abstractions for Approximate Planning in Large Structured Stochastic Domains

The theory of Markov Decision Processes (MDPs) provides algorithms for generating an optimal policy. For large domains these algorithms become intractable and approximate solutions become necessary. In this paper we extend previous work on approximate planning in large stochastic domains by using automatically-generated non-uniform abstractions which exploit the structure of the state space. We...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • AI Commun.

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2001